External Sort
   HOME

TheInfoList



OR:

External sorting is a class of
sorting Sorting refers to ordering data in an increasing or decreasing manner according to some linear relationship among the data items. # ordering: arranging items in a sequence ordered by some criterion; # categorizing: grouping items with similar pro ...
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
s that can handle massive amounts of
data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
. External sorting is required when the data being sorted do not fit into the
main memory Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers. The central processing unit (CPU) of a computer ...
of a computing device (usually
RAM Ram, ram, or RAM may refer to: Animals * A male sheep * Ram cichlid, a freshwater tropical fish People * Ram (given name) * Ram (surname) * Ram (director) (Ramsubramaniam), an Indian Tamil film director * RAM (musician) (born 1974), Dutch * Ra ...
) and instead they must reside in the slower external memory, usually a
disk drive Disk storage (also sometimes called drive storage) is a general category of storage mechanisms where data is recorded by various electronic, magnetic, optical, or mechanical changes to a surface layer of one or more rotating disks. A disk drive is ...
. Thus, external sorting algorithms are
external memory algorithm In computing, external memory algorithms or out-of-core algorithms are algorithms that are designed to process data that are too large to fit into a computer's main memory at once. Such algorithms must be optimized to efficiently fetch and access d ...
s and thus applicable in the external memory
model of computation In computer science, and more specifically in computability theory and computational complexity theory, a model of computation is a model which describes how an output of a mathematical function is computed given an input. A model describes how ...
. External sorting algorithms generally fall into two types, distribution sorting, which resembles
quicksort Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961, it is still a commonly used algorithm for sorting. Overall, it is slightly faster than ...
, and external merge sort, which resembles
merge sort In computer science, merge sort (also commonly spelled as mergesort) is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the order of equal elements is the same i ...
. The latter typically uses a
hybrid Hybrid may refer to: Science * Hybrid (biology), an offspring resulting from cross-breeding ** Hybrid grape, grape varieties produced by cross-breeding two ''Vitis'' species ** Hybridity, the property of a hybrid plant which is a union of two dif ...
sort-merge strategy. In the sorting phase, chunks of data small enough to fit in main memory are read, sorted, and written out to a temporary file. In the merge phase, the sorted subfiles are combined into a single larger file.


Model

External sorting algorithms can be analyzed in the
external memory model In computing, external memory algorithms or out-of-core algorithms are algorithms that are designed to process data that are too large to fit into a computer's main memory at once. Such algorithms must be optimized to efficiently fetch and access da ...
. In this model, a
cache Cache, caching, or caché may refer to: Places United States * Cache, Idaho, an unincorporated community * Cache, Illinois, an unincorporated community * Cache, Oklahoma, a city in Comanche County * Cache, Utah, Cache County, Utah * Cache Count ...
or internal memory of size and an unbounded external memory are divided into blocks of size , and the
running time In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by t ...
of an algorithm is determined by the number of memory transfers between internal and external memory. Like their
cache-oblivious In computing, a cache-oblivious algorithm (or cache-transcendent algorithm) is an algorithm designed to take advantage of a processor cache without having the size of the cache (or the length of the cache lines, etc.) as an explicit parameter. An ...
counterparts,
asymptotically optimal In computer science, an algorithm is said to be asymptotically optimal if, roughly speaking, for large inputs it performs at worst a constant factor (independent of the input size) worse than the best possible algorithm. It is a term commonly en ...
external sorting algorithms achieve a
running time In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by t ...
(in
Big O notation Big ''O'' notation is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity. Big O is a member of a family of notations invented by Paul Bachmann, Edmund Lan ...
) of O \left(\tfrac\log_ \tfrac \right).


External merge sort

One example of external sorting is the external
merge sort In computer science, merge sort (also commonly spelled as mergesort) is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the order of equal elements is the same i ...
algorithm, which is a
K-way merge algorithm In computer science, ''k''-way merge algorithms or multiway merges are a specific type of sequence merge algorithms that specialize in taking in k sorted lists and merging them into a single sorted list. These merge algorithms generally refer to m ...
. It sorts chunks that each fit in RAM, then merges the sorted chunks together. The algorithm first sorts items at a time and puts the sorted lists back into external memory. It then
recursively Recursion (adjective: ''recursive'') occurs when a thing is defined in terms of itself or of its type. Recursion is used in a variety of disciplines ranging from linguistics to logic. The most common application of recursion is in mathematics ...
does a \tfrac-way merge on those sorted lists. To do this merge, elements from each sorted list are loaded into internal memory, and the minimum is repeatedly outputted. For example, for sorting 900
megabyte The megabyte is a multiple of the unit byte for digital information. Its recommended unit symbol is MB. The unit prefix ''mega'' is a multiplier of (106) in the International System of Units (SI). Therefore, one megabyte is one million bytes o ...
s of data using only 100 megabytes of RAM: # Read 100 MB of the data in main memory and sort by some conventional method, like
quicksort Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961, it is still a commonly used algorithm for sorting. Overall, it is slightly faster than ...
. # Write the sorted data to disk. # Repeat steps 1 and 2 until all of the data is in sorted 100 MB chunks (there are 900MB / 100MB = 9 chunks), which now need to be merged into one single output file. # Read the first 10 MB (= 100MB / (9 chunks + 1)) of each sorted chunk into input buffers in main memory and allocate the remaining 10 MB for an output buffer. (In practice, it might provide better performance to make the output buffer larger and the input buffers slightly smaller.) # Perform a 9-way merge and store the result in the output buffer. Whenever the output buffer fills, write it to the final sorted file and empty it. Whenever any of the 9 input buffers empties, fill it with the next 10 MB of its associated 100 MB sorted chunk until no more data from the chunk is available. This is the key step that makes external merge sort work externally—because the merge algorithm only makes one pass sequentially through each of the chunks, each chunk does not have to be loaded completely; rather, sequential parts of the chunk can be loaded as needed. Historically, instead of a sort, sometimes a replacement-selection algorithm was used to perform the initial distribution, to produce on average half as many output chunks of double the length.


Additional passes

The previous example is a two-pass sort: first sort, then merge. The sort ends with a single ''k''-way merge, rather than a series of two-way merge passes as in a typical in-memory merge sort. This is because each merge pass reads and writes ''every value'' from and to disk, so reducing the number of passes more than compensates for the additional cost of a ''k''-way merge. The limitation to single-pass merging is that as the number of chunks increases, memory will be divided into more buffers, so each buffer is smaller. Eventually, the reads become so small that more time is spent on
disk seek Higher performance in hard disk drives comes from devices which have better performance characteristics. These performance characteristics can be grouped into two categories: access time and data transfer time (or rate). Access time The ''access ...
s than data transfer. A typical magnetic
hard disk drive A hard disk drive (HDD), hard disk, hard drive, or fixed disk is an electro-mechanical data storage device that stores and retrieves digital data using magnetic storage with one or more rigid rapidly rotating platters coated with magnet ...
might have a 10 ms access time and 100 MB/s data transfer rate, so each seek takes as much time as transferring 1 MB of data. Thus, for sorting, say, 50 GB in 100 MB of RAM, using a single 500-way merge pass isn't efficient: we can only read 100 MB / 501 ≈ 200 KB from each chunk at once, so 5/6 of the disk's time is spent seeking. Using two merge passes solves the problem. Then the sorting process might look like this: # Run the initial chunk-sorting pass as before to create 500×100 MB sorted chunks. # Run a first merge pass combining 25×100 MB chunks at a time, resulting in 20×2.5 GB sorted chunks. # Run a second merge pass to merge the 20×2.5 GB sorted chunks into a single 50 GB sorted result Although this requires an additional pass over the data, each read is now 4 MB long, so only 1/5 of the disk's time is spent seeking. The improvement in data transfer efficiency during the merge passes (16.6% to 80% is almost a 5× improvement) more than makes up for the doubled number of merge passes. Variations include using an intermediate medium like
solid-state disk A solid-state drive (SSD) is a solid-state storage device that uses integrated circuit assemblies to store data persistently, typically using flash memory, and functioning as secondary storage in the hierarchy of computer storage. It is ...
for some stages, even if there isn't enough of it to hold the full dataset. Repeating the example above with 20 GB of temporary storage on SSD, the first pass could merge 200×100 MB sorted chunks read from that temporary space to write 20GB sorted chunks to HDD. The 200-way merge with 500kb buffers is reasonably efficient due to the much greater random-read throughput of SSDs, the first pass can benefit from an SSD's higher read/write bandwidth, and the second pass has fewer larger chunks to merge and therefore fewer HDD seeks. SSDs can also be used as read buffers in a merge phase, allowing fewer larger reads from HDD storage. Given the relatively low cost of SSD space, it can be an economical tool for sorting large inputs with very limited memory. Like in-memory sorts, efficient external sorts require O(''n'' log ''n'') time: linear increases in data size require logarithmic increases in the number of passes, and each pass takes a linear number of reads and writes. Using the large memory sizes provided by modern computers the logarithmic factor grows very slowly. Under reasonable assumptions at least 500 GB of data stored on a hard drive can be sorted using 1 GB of main memory before a third pass becomes advantageous, and many times that much data can be sorted before a fourth pass becomes useful. Media with high random-read performance like
solid-state drive A solid-state drive (SSD) is a solid-state storage device that uses integrated circuit assemblies to store data persistently, typically using flash memory, and functioning as secondary storage in the hierarchy of computer storage. It is ...
s (SSDs) also increase the amount that can be sorted before additional passes improve performance. Main memory size is important. Doubling memory dedicated to sorting halves the number of chunks ''and'' the number of reads per chunk, reducing the number of seeks required by about three-quarters. The ratio of RAM to disk storage on servers often makes it convenient to do huge sorts on a cluster of machines rather than on one machine with multiple passes.


External distribution sort

External distribution sort is analogous to
quicksort Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961, it is still a commonly used algorithm for sorting. Overall, it is slightly faster than ...
. The algorithm finds approximately \tfrac pivots and uses them to divide the elements into approximately equally sized subarrays, each of whose elements are all smaller than the next, and then recurse until the sizes of the subarrays are less than the block size. When the subarrays are less than the block size, sorting can be done quickly because all reads and writes are done in the
cache Cache, caching, or caché may refer to: Places United States * Cache, Idaho, an unincorporated community * Cache, Illinois, an unincorporated community * Cache, Oklahoma, a city in Comanche County * Cache, Utah, Cache County, Utah * Cache Count ...
, and in the
external memory model In computing, external memory algorithms or out-of-core algorithms are algorithms that are designed to process data that are too large to fit into a computer's main memory at once. Such algorithms must be optimized to efficiently fetch and access da ...
requires O(1) operations. However, finding exactly \tfrac pivots would not be fast enough to make the external distribution sort
asymptotically optimal In computer science, an algorithm is said to be asymptotically optimal if, roughly speaking, for large inputs it performs at worst a constant factor (independent of the input size) worse than the best possible algorithm. It is a term commonly en ...
. Instead, we find slightly fewer pivots. To find these pivots, the algorithm splits the input elements into \tfrac chunks, and takes every \sqrt elements, and
recursively Recursion (adjective: ''recursive'') occurs when a thing is defined in terms of itself or of its type. Recursion is used in a variety of disciplines ranging from linguistics to logic. The most common application of recursion is in mathematics ...
uses the
median of medians In computer science, the median of medians is an approximate (median) selection algorithm, frequently used to supply a good pivot for an exact selection algorithm, mainly the quickselect, that selects the ''k''th smallest element of an initially u ...
algorithm to find \sqrt pivots. There is a
duality Duality may refer to: Mathematics * Duality (mathematics), a mathematical concept ** Dual (category theory), a formalization of mathematical duality ** Duality (optimization) ** Duality (order theory), a concept regarding binary relations ** Dual ...
, or fundamental similarity, between merge- and distribution-based algorithms.


Performance

Th
Sort Benchmark
created by computer scientist Jim Gray, compares external sorting algorithms implemented using finely tuned hardware and software. Winning implementations use several techniques: * Using parallelism ** Multiple disk drives can be used in parallel in order to improve sequential read and write speed. This can be a very cost-efficient improvement: a Sort Benchmark winner in the cost-centric Penny Sort category uses six hard drives in an otherwise midrange machine. ** Sorting software can use multiple threads, to speed up the process on modern multicore computers. ** Software can use
asynchronous I/O In computer science, asynchronous I/O (also non-sequential I/O) is a form of input/output processing that permits other processing to continue before the transmission has finished. A name used for asynchronous I/O in the Windows API is overlappe ...
so that one run of data can be sorted or merged while other runs are being read from or written to disk. ** Multiple machines connected by fast network links can each sort part of a huge dataset in parallel.Rasmussen et al.
TritonSort
/ref> * Increasing hardware speed ** Using more RAM for sorting can reduce the number of disk seeks and avoid the need for more passes. ** Fast external memory like
solid-state drives A solid-state drive (SSD) is a solid-state storage device that uses integrated circuit assemblies to store data persistently, typically using flash memory, and functioning as secondary storage in the hierarchy of computer storage. It is a ...
can speed sorts, either if the data is small enough to fit entirely on SSDs or, more rarely, to accelerate sorting SSD-sized chunks in a three-pass sort. ** ''Many'' other factors can affect hardware's maximum sorting speed: CPU speed and number of cores, RAM access latency, input/output bandwidth, disk read/write speed, disk seek time, and others. "Balancing" the hardware to minimize bottlenecks is an important part of designing an efficient sorting system. ** Cost-efficiency as well as absolute speed can be critical, especially in cluster environments where lower node costs allow purchasing more nodes. * Increasing software speed ** Some Sort Benchmark entrants use a variation on
radix sort In computer science, radix sort is a non-comparative sorting algorithm. It avoids comparison by creating and distributing elements into buckets according to their radix. For elements with more than one significant digit, this bucketing process i ...
for the first phase of sorting: they separate data into one of many "bins" based on the beginning of its value. Sort Benchmark data is random and especially well-suited to this optimization. ** Compacting the input, intermediate files, and output can reduce time spent on I/O, but is not allowed in the Sort Benchmark. ** Because the Sort Benchmark sorts long (100-byte) records using short (10-byte) keys, sorting software sometimes rearranges the keys separately from the values to reduce memory I/O volume.


See also

*
Mainframe sort merge The Sort/Merge utility is a mainframe program to sort records in a file into a specified order, merge pre-sorted files into a sorted file, or copy selected records. Internally, these utilities use one or more of the standard sorting algorithms, of ...
*
External memory algorithm In computing, external memory algorithms or out-of-core algorithms are algorithms that are designed to process data that are too large to fit into a computer's main memory at once. Such algorithms must be optimized to efficiently fetch and access d ...
*
Funnelsort Funnelsort is a comparison-based sorting algorithm. It is similar to mergesort, but it is a cache-oblivious algorithm, designed for a setting where the number of elements to sort is too large to fit in a cache where operations are done. It wa ...
* Cache-oblivious distribution sort


References

{{reflist, 30em


External links


STXXL, an algorithm toolkit including external mergesortA K-Way Merge ImplementationExternal-Memory Sorting in JavaA sample pennysort implementation using Judy ArraysSort Benchmark
Sorting algorithms External memory algorithms